Construction and utilization of bilingual speech corpus for simultaneous machine interpretation research

نویسندگان

  • Hitomi Tohyama
  • Shigeki Matsubara
  • Nobuo Kawaguchi
  • Yasuyoshi Inagaki
چکیده

This paper describes the design, analysis and utilization of a simultaneous interpretation corpus. The corpus has been constructed at the Center for Integrated Acoustic Information Research (CIAIR) of Nagoya University in order to promote the realization of the multi-lingual communication supporting environment. The size of transcribed data is about 1 million words, and the corpus would deserve to be called the simultaneous interpretation corpus of the largest-in-the-world class. The discourse tag and the utterance time tag were given to the corpus, and some software tools for corpus analysis in order to support the practical use of the corpus have been developed. Therefore, the corpus is expected to be useful not only for the development of simultaneous interpreting systems but also for the construction of an interpreting theory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Construction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation

Abstract With the development of speech and language processing, speech translation systems have been developed. These studies target spoken dialogues, and employ consecutive interpretation, which uses a sentence as the translation unit. On the other hand, there exist a few researches about simultaneous interpreting, and recently, the language resources for promoting simultaneous interpreting r...

متن کامل

Interpreting Unit Segmentation of Conversational Speech in Simultaneous Interpretation Corpus

The speech-to-speech translation system is becoming an important research topic with the progress of the speech and language processing technology. Considering efficiency and the smoothness of the cross-lingual conversation, the simultaneity of the translation processing has a great influence on the performance of the system. This paper describes interpreting unit segmentation of conversational...

متن کامل

Bilingual Spoken Monologue Corpus for Simultaneous Machine Interpretation Research

Abstract This paper describes a large-scale bilingual corpus of spoken monologues and their simultaneous interpretation, which has been constructed at CIAIR. The corpus has the following characteristics: (1) English and Japanese speeches are recorded in parallel, (2) the data contains monologue speeches such as lecture and self-introduction, and (3) the exact beginning and ending times are prov...

متن کامل

Collection of Simultaneous Interpreting Patterns by Using Bilingual Spoken Monologue Corpus

This paper provides an investigation of simultaneous interpreting patterns using a bilingual spoken monologue corpus. 4,578 pairs of English-Japanese aligned utterances in CIAIR simultaneous interpretation database were used. This investigation is the largest scale as the observation of simultaneous interpreting speech. The simultaneous interpreters are required to generate the target speech si...

متن کامل

Spoken language corpus for machine interpretation research

This paper describes a database consisting of speech and language, which we are currently constructing for the purpose of the research on machine interpretation. The database contains bilingual data of lectures and dialogues. We have collected the speech of about 72 hours in total and transcribed it into the text manually. We have investigated the database in order to acquire empirical knowledg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005